Overview of Topics

  • Introduction
  • Motivating Example
  • Modifying Principal Component Analysis
  • Simulation of NOC Observations
  • Fault Induction
  • Simulation Study Results
  • Case Study Results
  • Software Package for R
  • Summary and References

Introduction

Introduction

  • Many factories or other closed production systems use real-time online process monitoring.
  • We aim to detect potential problems within the system before they cause damage or cause the process to shut down.
  • The process has many sensors, with a time series for each.
  • These sensor readings are not independent across time, and their values may have daily trends / perturbations (autocorrelated and non-stationary).
  • The sensors themselves are not independent (cross-correlation of the multiple time series).
  • Distributions of process variables can change dramatically with operator input.

Process Dimensionality

  • In multivariate contexts, the number of parameters to estimate increases quadratically with dimension.
  • That is, we must estimate \(p + \frac{p}{2}(p + 1) \in \mathbb{O}(p^2)\) first- and second-moment parameters if we have \(p\) features.
  • Linear dimension reduction (LDR) allows us to simplify complex data structures to more simple data structures via linear combinations of features.
  • LDR is built around eigen-decomposition / principal component analysis (PCA) or the singular value decomposition (SVD).

Motivating Example

Facility and Process Details

  • This work is motivated by our partnership with the Colorado School of Mines on the ReNUWit water preservation grant.
  • Our team manages a decentralised wastewater treatment (WWT) plant in Golden, CO.
  • We measure 40 features and their lagged values (80 total features).
  • The continuous features are aggregated to the minute-level.
  • We aim to develop a monitoring process capable of quickly and accurately detecting when a part of the cleaning process malfunctions before human operators.
  • Earlier work applying PCA to WWT was completed by Wise et al. (1988) and Kresta, MacGregor, and Marlin (1991)
  • More recent work applying PCA was by Baggiani and Marsili-Libelli (2009), Sanchez-Fernandez, Fuente, and Sainz-Palmero (2015), and Kazor et al. (2016)

The Bioreactor System

Multi-State Process

Example Features

Example Feature by Blower

Data Cleaning

Modifying PCA

Process Notation

  • \(\mathcal{P}\) is a real non-linear, autocorrelated, non-stationary, state-switching, \(p\)-dimensional multivariate process.
  • \(s = 1, \ldots, n\) is the observation index of \(\mathcal{P}\).
  • \(\textbf{X}_s := \left(X_1(s), X_2(s), \ldots, X_p(s)\right) \in\mathbb{R}_{1 \times p}\) is a whitened realization from \(\mathcal{P}\).
  • \(\textbf{X} := [\textbf{X}_1\ \vdots\ \textbf{X}_2\ \vdots\ \cdots\ \vdots\ \textbf{X}_n]^T \in \mathbb{R}_{n\times p}\) is the data matrix.
  • \(\textbf{X}^T\textbf{X}\in\mathbb{R}_{p \times p}^{\ge}\) is the scatter matrix of \(\textbf{X}\).
  • \(\mathbb{R}_{p \times p}^{\ge}\) is the \(p\)-dimensional cone of real, non-negative definite matrices.

PCA Basics

  • \(\textbf{X}^T\textbf{X} := \underset{p \times p}{\textbf{P}} \times \underset{p \times p}{\boldsymbol\Lambda} \times \underset{p \times p}{\textbf{P}}^T\) is the eigendecomposition of the scatter matrix.
  • \(\textbf{P}^T\) is the \(p \times p\) orthogonal matrix of eigenvectors.
  • \(\boldsymbol\Lambda\) is the diagonal matrix of the \(p\) eigenvalues.
  • \(\textbf{P}_d\) is the \(p\times d\) projection matrix preserving the total data variation (energy) corresponding to \[ \left[\sum\limits_{i = 1}^d \lambda_i\right] / \text{tr}\left(\boldsymbol\Lambda\right). \]
  • \(\textbf{Y}_s := \textbf{X}_s\textbf{P}_d \in\mathbb{R}_{1\times d}\) is the reduced-dimension representation of observation \(\textbf{X}_s\).

Competing Methods

PCA fails because of the autocorrelation and non-linearity / non-stationarity of the data. These are a few of the methods currently employed:

  • Adaptive-Dynamic PCA (AD-PCA) of Kazor et al. (2016)
  • Kernel PCA (kPCA) of Ge, Yang, and Song (2009)
  • Adaptive kPCA (AkPCA) of Chouaib, Mohamed-Faouzi, and Messaoud (2013)
  • Local Linear Embedding (LLE) Miao et al. (2013)
  • Multi-dimensional scaling and isometric mapping (IsoMap) of Tenenbaum, Silva, and Langford (2000)
  • Semidefinite Embedding (SDE) / Maximimum Variance Unfolding (MVU) of Weinberger and Saul (2006)

We combine Adaptive, Dynamic, and Multi-State modifications to PCA.

Adaptive PCA

  • Assume \(\mathcal{P}\) is locally linear within a window \(w + n_u\).
  • Create a \(w\)-width rolling training window preceding index \(s\) by defining \[ \textbf{X}_{w} := [\textbf{X}_{s - w + 1}\ \vdots\ \textbf{X}_{s - w + 2}\ \vdots\ \cdots\ \vdots\ \textbf{X}_{s - 1}\ \vdots\ \textbf{X}_s]^T. \]
  • Calculate the scatter matrix with \(\textbf{X}_{w}\) instead of the full data matrix \(\textbf{X}\).
  • Estimate \(\textbf{P}_d\).
  • After performing necessary monitoring statistics on the new observations, ''learn'' the newest \(n_u\) observations, ''forget'' the oldest \(n_u\) observations, and estimate the new scatter matrix.

For an excellent summary of Adaptive and Dynamic PCA, see Ge and Song (2012).

Dynamic PCA

  • Because the features are correlated over time, we include up to \(\ell\) lags per feature.
  • The observation vector at index \(s\) is now \[ \begin{align} \text{L}(\textbf{X}_s) := &[X_1(s), X_1\left( s - 1 \right), \ldots, X_1\left( s - \ell \right),\ X_2(s), X_2\left( s - 1 \right), \ldots, X_2\left( s - \ell \right), \\ &\qquad \cdots,\ X_p(s), X_p\left( s - 1 \right), \ldots, X_p\left( s - \ell \right)]. \end{align} \]
  • These \(n - \ell\) rows form the lagged data matrix \[ \text{L}(\textbf{X}) := [\text{L}(\textbf{X}_{\ell})\ \vdots\ \text{L}(\textbf{X}_{\ell + 1})\ \vdots\ \cdots\ \vdots\ \text{L}(\textbf{X}_n)]^T \in\mathbb{R}_{(n - \ell) \times p(\ell + 1)}. \]
  • Calculate the scatter matrix with \(\text{L}(\textbf{X})\) instead of the data matrix \(\textbf{X}\).
  • Estimate the \(p(\ell + 1) \times d\) projection matrix \(\textbf{P}_d\).

Multi-State PCA

  • Let \(\mathcal{S}_k,\ k = 1, \ldots, K\) be a set of disjoint process states.
  • For the index \(s\), the observation \(\textbf{X}_{s}\) is said to belong in state \(\mathcal{S}_k\) if and only if \[ \textbf{X}_{s} \sim f_k\left( \boldsymbol\mu_k, \boldsymbol\Sigma_k \right). \]
  • \(f_k\) is some distribution with location parameter \(\boldsymbol\mu_k \in \mathbb{R}_{p\times 1}\) and \(\boldsymbol\Sigma_k \in \mathbb{R}^{\ge}_{p\times p}\)
  • Because the distribution changes from state to state, the principal components are also a function of the state.
  • Partition \(\textbf{X}\) as \(\left\{\textbf{X}^{(1)}, \textbf{X}^{(2)}, \ldots, \textbf{X}^{(K)}\right\}\), where \(\textbf{X}^{(k)}\) is the ordered set of observations under state \(\mathcal{S}_k\).
  • Estimate \(K\) distinct projection matrices \(\left\{\textbf{P}_d^{(1)}, \textbf{P}_d^{(2)}, \ldots, \textbf{P}_d^{(K)}\right\}\) from the partitioned \(\textbf{X}\).

Monitoring Statistics

These statistics are estimated non-parametrically, as discussed by Kazor et al. (2016)

Hotelling's \(T^2\)

  • The \(T^2\) statistic is the Mahalanobis distance of the mapped value \(\textbf{Y}_{s + 1}\) from the original space into the PCA-subspace.
  • The \(T^2\) statistic is \(T^2_{s + 1} = \textbf{Y}_{s + 1}\boldsymbol\Lambda_d^{-1} \textbf{Y}_{s + 1}^T\), for \(\boldsymbol\Lambda_d := \text{diag}\left( \lambda_1, \lambda_2, \ldots, \lambda_d \right)\).

Squared Prediction Error

  • The SPE statistic measures the goodness-of-fit of the \(d\)-dimensional model to the \(p\)-dimensional process observations, or how well \(\textbf{P}_d\) approximates \(\textbf{P}\).
  • The SPE statistic is \(\text{SPE}(\textbf{Y}_{s + 1}) = \left( \textbf{X}_{s + 1} - \textbf{Y}_{s + 1}\textbf{P}^T_d \right) \left( \textbf{X}_{s + 1} - \textbf{Y}_{s + 1}\textbf{P}^T_d \right)^T.\)

Simulation of NOC Observations

Notation

To induce non-linearity and non-stationarity of each feature, let

  • the errors of \(t\), \(\epsilon_t\), be drawn from a first-order autoregressive process over \(s\), \(\epsilon_s = \varphi\epsilon_{s - 1} + \varepsilon\), where \(\varepsilon \sim N(0, \sigma_{\varepsilon})\),
  • \(t_s := g(s) + \epsilon_s\) be a latent feature, with \(g\) restricted to smooth and bounded functions with \(g^{\prime}(s) \ne 0\) for some \(s\), and
  • \(X_i(s) := X_i(t_s)\), where \(X_i\) is piecewise-defined over the state boundaries and smooth within those boundaries.

Then the full non-linear, autocorrelated, non-stationary simulated process at index \(s\) is \(\textbf{X}_s := \left\langle x_1 = X_1(t_s), x_2 = X_2(t_s), \ldots, x_p = X_p(t_s) \right\rangle.\)

Details of Data Generation

We follow Kazor et al. (2016) and continue updating the original design of Dong and McAvoy (1996). Thus

  • The process dimension \(p = 3\),
  • The cycle length is \(\omega = 7 * 24 * 60 = 10,080\), corresponding to one observation each minute over one week,
  • The observation index \(s = 1, \ldots, \omega\),
  • The observation at index \(s\) is given by \(\textbf{X}_s := \left\langle x_1 = X_1(t_s), x_2 = X_2(t_s), \ldots, x_p = X_p(t_s) \right\rangle\), and
  • The autocorrelation parameter \(\varphi = 0.75\).

Errors and Latent Feature Construction

  1. Draw the first innovation from \[ \varepsilon_1 \sim \mathcal{N}\left(\frac{1}{2}(a + b)(1 - \phi), \frac{b - a}{12}(1 - \phi ^ 2)\right), \] where \(a = 0.01\) and \(b = 2\). These mean and variance multipliers are from the mean and variance of the uniform\((a,b)\) distribution.
  2. Define the AR(1) error structure as \(\epsilon_s := \varphi \epsilon_{s - 1} + (1 - \varphi) \varepsilon,\quad s = 2, 3, \ldots\).
  3. Draw the latent feature \(t\) as \[ t_s := -\cos\left(\frac{2\pi}{\omega}s\right) + \varepsilon_s,\quad s = 1, \ldots, \omega. \]
  4. Scale \(t_s\) into \([a, b]\).

Feature and State Definitions

  1. Sample three machine-error vectors of length \(\omega\), \(\textbf{e}_1, \textbf{e}_2, \textbf{e}_3\overset{i.i.d.}{\sim}\mathcal{N}(0, 0.01)\).
  2. Define the three feature vectors to be \[ \begin{align*} x(t_s) &:= t_s + \textbf{e}_1(s) \\ y(t_s) &:= t_s^2 - 3t_s + \textbf{e}_2(s) \\ z(t_s) &:= -t_s^3 + 3t_s^2 + \textbf{e}_3(s). \end{align*} \]
  3. Change states hourly in sequence, where the three process states are

    • \(\mathcal{S}_1\): \(\textbf{X}(t_s) := \langle x(t_s), y(t_s), z(t_s) \rangle \cdot \textbf{I}\)
    • \(\mathcal{S}_2\): \(\textbf{X}(t_s) := \langle x(t_s), y(t_s), z(t_s) \rangle \cdot \textbf{P}_2 \boldsymbol\Lambda_2\)
    • \(\mathcal{S}_3\): \(\textbf{X}(t_s) := \langle x(t_s), y(t_s), z(t_s) \rangle \cdot \textbf{P}_3 \boldsymbol\Lambda_3\)

Process Projections

The process-specific scaling and projection matrices are: \[ \begin{gather} \textbf{P}_2 = \begin{bmatrix} 0 & 0.50 & -0.87 \\ 0 & 0.87 & 0.50 \\ 1 & 0 & 0 \end{bmatrix} & & \boldsymbol\Lambda_2 = \begin{bmatrix} 1 & 0 & 0 \\ 0 & 0.5 & 0 \\ 0 & 0 & 2 \end{bmatrix} \\ & & \\ \textbf{P}_3 = \begin{bmatrix} 0 & 0.87 & -0.50 \\ -1 & 0 & 0 \\ 0 & 0.50 & 0.87 \end{bmatrix} & & \boldsymbol\Lambda_3 = \begin{bmatrix} 0.25 & 0 & 0 \\ 0 & 0.1 & 0 \\ 0 & 0 & 0.75 \end{bmatrix} \end{gather} \] \(\textbf{P}_2\) and \(\textbf{P}_3\) rotate the features so that the states are at right angles in at least one dimension. \(\boldsymbol\Lambda_2\) and \(\boldsymbol\Lambda_3\) inflate or deflate the feature variances along each principal component.

Example Process Graph

A single draw from the multivariate process under NOC. The vertical black line (21:40 on 2 December) marks time at which the fault would be induced.

Monte Carlo Replication

We repeat the following process 1000 times:

  1. Draw a random set of observations:
    • one set of NOC under the single-state model,
    • one set of NOC under the multi-state model.
  2. Apply each of the six single-state and nine multi-state faults (discussed in the next section) to both NOC data sets. There will be 17 total versions of the data set.
  3. Train the fault detection system with 75% of the observations:
    • 7,560 observations for the single-state model,
    • 2,520 observations per state for the multi-state model.
  4. Calculate AD- and MSAD-PCA projections for all 17 data sets.

Performance Metrics

An observation is flagged as suspicious if the monitoring statistic is beyond the calculated non-parametric threshold. An observation will trigger an alarm if it is the \(k^{th}\) observation in a row to be flagged. For this simulation study, we triggered an alarm at the third consecutive flag.

For each statistic under AD- and MSAD-PCA, measure

  • The false alarm rate from the NOC data sets. This is the number of false alarms recorded in the process observations without faults.
  • The fault detection indicator for each of the 15 fault data sets.
  • The time (in minutes) until the statistic detected the fault. The earliest a statistic can trigger an alarm is after 3 minutes.

Fault Induction

Fault Overview

Induce one of the following faults at \(s = 8,500\).

Shift Faults Defined

  • Fault 1A: \(\textbf{X}^*(t_s) = \textbf{X}(t_s) + 2,\ s \ge 8500.\) This fault is a shift for all features in all states. This shift is added before state projections.

  • Fault 1B: \(x^*(t_s) = x^*(t_s) + 2,\ s \ge 8500.\) This fault is a shift for one feature in all states. This shift is added before state projections.

  • Fault 1C: \(x^*(t_s) = x^*(t_s) + 0.5,\ z^*(t_s) = z^*(t_s) + 0.5,\ s \ge 8500.\) This fault simulates a shift in two of the process monitoring features in \(\mathcal{S}_3\) only. This shift is added after state projections.

Fault 1A

Add 2 to all features in all states.

Fault 1B

Add 2 in all states to feature \(X\) only. This shift in feature \(X\) will infect features \(Y\) and \(Z\) through \(\textbf{P}_2\boldsymbol\Lambda_2\) and \(\textbf{P}_3\boldsymbol\Lambda_3\).

Fault 1C

Add 0.5 to features \(X\) and \(Z\) only in state \(\mathcal{S}_3\). This fault is induced after state projections.

Drift Faults Defined

  • Fault 2A: \(\textbf{X}^*(t_s) = \textbf{X}(t_s) + (s - 8500)\times 10^{-3}\), \(s > 8500.\) This fault simulates a positive drift across all the process monitoring features. This drift is added before state projections.

  • Fault 2B: \(y^*(t_s) = y(t_s) + (s - 8500)\times 10^{-3}\) and \(z^*(t_s) = z(t_s) + (s - 8500)\times 10^{-3}\), for \(s > 8500.\) This fault simulates a positive drift in two of the process monitoring features. This drift is added before state projections.

  • Fault 2C: \(y^*(t_s) = y(t_s) - 1.5 \times \frac{s - 8500}{10080 - 8500}\), for \(s > 8500.\) This fault simulates a negative drift in one of the process monitoring features in \(\mathcal{S}_2\) only. This drift is added after state projections.

Fault 2A

Drift all features in all states by a maximum \(+1.58\).

Fault 2B

Drift features \(Y\) and \(Z\) in all states by a maximum \(+1.58\). This drift in features \(Y\) and \(Z\) will infect feature \(X\) through \(\textbf{P}_2\boldsymbol\Lambda_2\) and \(\textbf{P}_3\boldsymbol\Lambda_3\).

Fault 2C

Drift feature \(Y\) by a maximum \(-1.5\) only in state \(\mathcal{S}_2\). This fault is induced after state projections.

Latent and Error Faults Defined

  • Fault 3A: \(\textbf{X}^*(t_s) = \textbf{X}(t_s^*)\), where \(t_s^* = \left[\frac{\delta(s - 8500)}{\omega - 8500} + 1\right]t_s\). For \(s > 8500,\) this fault will amplify the underlying latent effect for all features. The maximum latent drift of this fault will be \(\delta + 1\). For the single-state simlation, \(\delta + 1 = 3\); for the multi-state simulation, \(\delta + 1 = 6\). This mutation is added before state projections.

  • Fault 3B: \(z^*(t_s) = z(t_s^*)\), where \(t_s^* = \ln|t_s|\), \(s \ge 8500.\) This fault will dampen the underlying latent effect for \(x_3\) if \(t_s > 1\) and amplify this effect if \(t_s < 1\). This mutation is added before state projections.

  • Fault 3C: \(y^*(t_s) = y(t_s) + 2 * \textbf{e}_2(s) - 0.25\), for \(s > 8500.\) This fault simulates an error increase and negative shift, but only applied to one process monitoring feature in \(\mathcal{S}_2\). This mutation is added after state projections.

Fault 3A

Drift the latent variable \(t\) by a maximum of \(+6\) for all features in all states.

Fault 3B

Mutate \(t\) by \(\log t\) in all states for feature \(Z\) only. This mutation through feature \(Z\) will infect features \(X\) and \(Y\) through \(\textbf{P}_2\boldsymbol\Lambda_2\) and \(\textbf{P}_3\boldsymbol\Lambda_3\).

Fault 3C

Double and shift by \(-0.25\) the machine-error vector on feature \(Y\) in state \(\mathcal{S}_2\) only. This fault is induced after state projections.

Simulation Study Results

Simulation Results

We present an example of parsing the following tables:

  • Consider the \(T^2\) monitoring statistic on multi-state process data after projections with the AD-PCA projection matrix.
  • This statistic has an expected ~18 false alarms per day.
  • Now consider this statistic's monitoring behavior for Fault 1C.
  • This statistic detects a fault in 96.5% of the 1000 simulation replicates.
  • When the \(T^2\) statistic detected a fault, it did so after 122 minutes, on average.
  • When the \(T^2\) statistic detected a fault, it did so within 490 minutes in 95% of cases.
  • The cell is not shaded green, so this statistic had less than median detection probability. The median detection probability was 100%.

False Alarm Rates

Detection Times

Comparison

Case Study Results

False Alarm Rates

Fault Detection Times

Software Package for R

Package mvMonitoring

  • No website: our package is currently in private beta-testing at the facility.
  • So far, engineers have been pleased with what we've developed, and they are working with us closely to polish the package.
  • mspProcessData() generates random draws from a serially autocorrelated and nonstationary multi-state (or single-state) multivariate process.
  • mspTrain() trains the projection matrix and non-parametric fault detection thresholds.
  • mspMonitor() assesses incoming process observations and classifies them as normal or abnormal.
  • mspWarning() keeps a running tally of abnormal observations and raises an alarm if necessary.
  • Alarms can be sent via email or SMS from the remote facility to the operators.

Summary and References

Summary

Future Work

References

Baggiani, F., and S. Marsili-Libelli. 2009. “Real-Time Fault Detection and Isolation in Biological Wastewater Treatment Plants.” Water Sci. Technol. 60 (11): 2949–61. doi:10.2166/wst.2009.723.

Chouaib, C., H. Mohamed-Faouzi, and D. Messaoud. 2013. “Adaptive Kernel Principal Component Analysis for Nonlinear Dynamic Process Monitoring.” In Control Conference (ASCC), 2013 9th Asian, 1–6. doi:10.1109/ASCC.2013.6606291.

Dong, Dong, and Thomas J. McAvoy. 1996. “Batch Tracking via Nonlinear Principal Component Analysis.” AIChE J. 42 (8): 2199–2208. doi:10.1002/aic.690420810.

Ge, Zhiqiang, and Zhihuan Song. 2012. Multivariate Statistical Process Control: Process Monitoring Methods and Applications. Advances in Industrial Control. Springer.

Ge, Zhiqiang, Chunjie Yang, and Zhihuan Song. 2009. “Improved Kernel PCA-Based Monitoring Approach for Nonlinear Processes.” Chem. Eng. Sci. 64 (9): 2245–55. doi:10.1016/j.ces.2009.01.050.

Kazor, Karen, Ryan W. Holloway, Tzahi Y. Cath, and Amanda S. Hering. 2016. “Comparison of Linear and Nonlinear Dimension Reduction Techniques for Automated Process Monitoring of a Decentralized Wastewater Treatment Facility.” Stoch. Environ. Res. Risk Assess. 30 (5): 1527–44. doi:10.1007/s00477-016-1246-2.

Kresta, James V., John F. MacGregor, and Thomas E. Marlin. 1991. “Multivariate Statistical Monitoring of Process Operating Performance.” Can. J. Chem. Eng. 69 (1): 35–47. doi:10.1002/cjce.5450690105.

Miao, Aimin, Zhihuan Song, Zhiqiang Ge, Le Zhou, and Qiaojun Wen. 2013. “Nonlinear Fault Detection Based on Locally Linear Embedding.” J. Control Theory Appl. 11 (4): 615–22. doi:10.1007/s11768-013-2102-2.

Sanchez-Fernandez, A., M. J. Fuente, and G. I. Sainz-Palmero. 2015. “Fault Detection in Wastewater Treatment Plants Using Distributed PCA Methods.” In 2015 IEEE 20th Conference on Emerging Technologies Factory Automation (ETFA), 1–7. doi:10.1109/ETFA.2015.7301504.

Tenenbaum, Joshua B., Vin de Silva, and John C. Langford. 2000. “A Global Geometric Framework for Nonlinear Dimensionality Reduction.” Science 290 (5500): 2319–23. doi:10.1126/science.290.5500.2319.

Weinberger, Kilian Q., and Lawrence K. Saul. 2006. “Unsupervised Learning of Image Manifolds by Semidefinite Programming.” Int. J. Comput. Vision 70 (1): 77–90. doi:10.1007/s11263-005-4939-z.

Wise, B. M., D. J. Veltkamp, B. Davis, N. L. Ricker, and B. R. Kowalski. 1988. “Principal Components Analysis for Monitoring the West Valley Liquid-Fed Ceramic Melter.” In Management of Radioactive Wastes, and Non-Radioactive Wastes from Nuclear Facilities. Vol. 20. Tucson, AZ.